DOCUMENT RESUME 



ED 459 212 



TM 033 511 



AUTHOR 

TITLE 

INSTITUTION 
SPONS AGENCY 
PUB DATE 
NOTE 



PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Loomis, Susan Cooper 

Judging Evidence of the Validity of the National Assessment 
of Educational Progress Achievement Levels . 

ACT, Inc., Iowa City, IA. 

National Assessment Governing Board, Washington, DC. 
2001-06-00 

4 Op . ; Paper presented at the Annual Meeting of the Council 
of Chief State School Officers (Houston, TX, June 24-27, 

2001 ) . 

Reports - Descriptive (141) -- Speeches/Meeting Papers (150) 

MF01/PC02 Plus Postage. 

* Academic Achievement; ^Cutting Scores; Elementary Secondary 
Education; ^Evaluation Methods; National Surveys; Standards; 
*Test Results; ^Validity 

Competency Tests; ^National Assessment of Educational 
Progress; ^Standard Setting 



ABSTRACT 



This paper describes (1) the procedures developed to set 
achievement levels for the National Assessment of Educational Progress (NAEP) 
that contribute to establishing the validity of the levels and (2) the 
research studies designed to collect information related to the validity of 
the achievement levels and the outcomes of the process. The central issue in 
examining the validity of standards is whether there is evidence of 
procedural validity. The standards must be generally accepted as reasonable 
for the outcomes of the process for setting cutpoints to be valid. For each 
of the three American College Testing (ACT, Inc.) program contracts with the 
National Assessment Governing Board, the process of developing achievement 
levels descriptions has been different, as described, but in all cases there 
has been an effort to solicit broad-based commentary about the reasonableness 
of the achievement level descriptions. The selection of the panelists is 
important, since standard setting panels must be seen as credible. The paper 
describes the selection of panelists, field trials and pilot studies, 
training for facilitators, and panelist training. Several different rating 
methodologies have been evaluated and. tested in panel studies for the NAEP 
achievement level process, but the modified Angoff method has the most solid 
research base in standard setting. Panelists participate in three rounds of 
item-by-item ratings with a variety of feedback after each round completing 
evaluations throughout the process. ACT, Inc. has performed various types of 
evaluations of the standard setting process and data. These include analyses 
of standard-setting data that are somewhat standard and research studies 
related to validation that are further divided into studies using item 
mapping procedures, studies comparing teachers’ judgments of performance to 
empirical classifications of student performance, and studies comparing 
judgments of performance represented in test booklets to the empirical 
classification of these booklets. In the end, there is no way to know with 
certainty that cutscores are valid, although substantial effort goes into 
ensuring procedural validity. (Contains 21 tables and 50 references.) (SLD) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 



TM033511 



Judging Evidence of the Validity of the 
National Assessment of Educational Progress Achievement Levels 



by 



Susan Cooper Loomis 
Senior Research Associate 
Policy Research Department 

ACT, Inc. 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL HAS 
BEEN GRANTED BY 

€ -C • c S 



U.S. DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement 
EDUCATIONAL RESOURCES INFORMATION 
j / CENTER (ERIC) 

□ This document has been reproduced as 
received from the person or organization 
originating it. 



□ Minor changes have been made to 
improve reproduction quality. 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 



• Points of view or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 



This paper was prepared for presentation at the CCSSO Large-Scale Assessment Conference, 
June 24-27, 2001, Houston. 

The research reported in this paper was supported by contracts with the National Assessment 
Governing Board. 



yz 



o 



BEST COPY AVAILABLE 



Judging Evidence of the Validity of the 
National Assessment of Educational Progress Achievement Levels 



Susan Cooper Loomis 
Senior Research Associate 
Policy Research Department 
ACT, Inc. 



Introduction 

When training panelists in the standard-setting process and informing them about the purposes of 
setting standards, one is likely to talk about how standard setting helps to answer the question: 
How much is enoughl That’s a great question to be able to answer, and panelists seem 
immediately to perceive the benefits of setting standards to answer it. 

A slightly different question is often asked about the outcomes of a standard-setting procedure. It 
is also a great question, but a great answer is not so easy to find. How can you know that students 
can do the things that the standards say they should be able to do? This question focuses 
attention on the issue of the validity of the standards. 

The procedures developed to set achievement levels for the National Assessment of Educational 
Progress (NAEP) that contribute to establishing the validity of the levels will be presented in this 
paper. In addition, the research studies designed to collect information related to the validity of 
the achievement levels and the outcomes of the process will be described. 

The procedures described here were developed under three different contracts awarded to ACT 
by the National Assessment Governing Board (NAGB) from 1991 and 2001 . These procedures 
were developed with the advice of numerous experts. Members of the Technical Advisory 
Committee on Standard Setting are the principal contributors. 1 

Procedural Validity 

Perhaps the central issue in examining the validity of standards is whether there is evidence of 
procedural validity. Kane (2001) points out that “procedural evidence is often considered 
adequate to provide basic support for the performance standards and cutscores unless there is 
conflicting evidence suggesting that the performance standard or cutscore is inappropriate.” 

(p. 64) He goes on to point out that a standard-setting process would not be judged to be valid if 
there were a lack of evidence of procedural validity, but evidence of procedural validity does not 
assure the validity of the process. Procedural validity is a necessary — but not a sufficient — 
condition for validity. 

There are many different sources of guidelines and lists of good practices associated with setting 
standards and providing evidence of the validity of the outcomes. Cizek, 1996; Hambleton, 2001; 



1 Members of the Technical Advisory Committee on Standard Setting (TACSS) include William Brown, 
Barbara Dodd, Robert Forsyth, Ronald Hambleton, John Mazzeo, William Mehrens, Jeff Nellhaus, Mark 
Reckase, Douglas Rindone, Wim van der Linden, and Rebecca Zwick. Robert Brennan has served on the 
ACT Technical Advisory Team (TAT) and as a representative of that team to the TACSS for the entire 
period. Reckase also served on TAT before joining the TACSS. Forsyth, Hambleton, Mehrens, Reckase, 
and Brennan have been technical advisors for the NAEP Achievement Levels-Setting (ALS) contract since 
1991. Michael Kane, Brenda Loyd (deceased), Eugene Johnson, and James Carlson were all former 
members of TACSS. 




Q 



Kane, 1994; 1995; 2001; Mehrens, 1995; and Mehrens and Cizek, 2001 are but a few. Each of 
those authors has contributed significantly to the NAEP ALS process, and much of what is 
presented here will have been shaped by their input over the years. Rather than focusing on a 
single list, however, this paper will survey relevant procedures and attempt to highlight ways in 
which each can be used in establishing the validity of the NAEP ALS Process. 

Statements of the Standards 

The statement of the standards must be generally accepted as reasonable in order for the 
outcomes of the process for setting cutpoints to be valid. If the statements of what students 
should know and be able to do are not judged to be reasonable, the cutpoints cannot be judged to 
be reasonable. Thus, having general agreement on the reasonableness of the statements of 
standards is a necessary condition for valid standards. 

The NAEP Achievement Levels Descriptions are the statements of the standards for a subject and 
grade. NAGB determined that there would be three achievement levels or goals for NAEP: 

Basic, Proficient, and Advanced. 2 Further, NAGB defined each goal in general terms, and those 
are referred to as the policy definitions. The policy definitions are general statements describing 
student performance at each level. There is no mention of grade level and no mention of subject 
matter. These policy definitions serve as the general calibrators for the achievement levels 
descriptions. The achievement levels descriptions are operational definitions of the policy 
definitions, taken from the framework, to describe what student should know and be able to do at 
each achievement level for each grade. 

For each of ACT’s three contracts with NAGB, the process of developing achievement levels 
descriptions has been different. In all cases, there has been a genuine effort to solicit broad-based 
commentary regarding the reasonableness of the achievement levels descriptions. The 
differences have been with respect to when the achievement levels descriptions were developed 
and when reviews have occurred. 

In the 1992 NAEP ALS processes implemented to set achievement levels in mathematics, 
writing, and reading, panelists were asked to develop statements of what students should know 
and be able to do in order to meet the general definitions of the three performance levels (Basic, 
Proficient, and Advanced) given by NAGB. Working with the policy definitions, panelists 
engaged in brainstorming sessions to develop operational definitions of performance at each level 
for the subject and grade level of the panel. Those definitions were then evaluated and modified 
by panels of teachers and curriculum specialists who participated in validation studies for each 
subject. The Achievement Levels Definitions (ALDs) were shared with various stakeholder 
groups in a series of public comment forums to collect additional input and recommendations 
before adoption by NAGB. 

Preliminary achievement level definitions were included in the development of the subject 
frameworks for geography, U.S. history, and science — the subjects for which achievement levels 
were set in ACT’s second contract with NAGB, 1 993-1997. The preliminary ALDs were 
reviewed as part of the framework development review, and that generally includes several large- 
scale public comment forums scheduled throughout the nation. During the ALS processes in 1994 
and 1996, panelists evaluated the preliminary achievement levels definitions and modified them if 
necessary. In fact, few changes were made to any of the preliminary definitions. Following the 
ALS process, the achievement levels descriptions were included in review packets distributed as 
part of the collection of public commentary on the achievement levels prior to adoption by 



2 See Public Law 100-297 (1988). 




2 



NAGB. 3 Public comment was collected in meetings scheduled in Washington, D.C. and by letters 
from individuals and representatives of groups, agencies, and organizations. 

The format for the first two contracts had been to have the development or modification of ALDs 
by panelists during the ALS process and to have the large-scale review after the cutpoints were 
set. ACT’s 1998 proposal included a plan for finalizing the achievement levels definitions prior 
to convening the achievement levels-setting (ALS) panels. Review panels were convened 
throughout the country, and the members of the review panels were selected according to the 
guidelines used for identifying ALS panelists. Their recommendations were collected for 
evaluation and implementation by a small panel of content experts who had developed the subject 
frameworks for NAEP. The review process was iterative. ACT collected public commentary 
from stakeholder groups for additional review and evaluation by the content experts. 
Modifications and reviews continued until general consensus seemed to have been reached. The 
process of finalizing the ALDs was comprehensive and thorough. 4 

There was some concern among ACT staff and TACSS members that panelists would not have 
the same “buy-in” to those definitions — and the process — as they had in previous procedures 
when the panelists were allowed to shape the statements of the standards. There was no evidence 
of this, however. Panelists reported levels of satisfaction with the ALDs and levels of 
understanding their meaning that were equal to those reported by panelists who spent hours 
modifying the definitions (Loomis & Hanick 2000b; 2000c). The 1998 panelists had more time 
to spend on forming a clear concept of borderline performance and on other aspects of the process 
because they were not involved in modifying the definitions. This plan worked very well. 

Having the statements of standards set before the standard-setting panels are convened seems to 
be a worthy goal. If the review process is held after the standard-setting panels are convened, 
then the policy board is responsible for making any adjustments in the cutscores that may seem 
appropriate in light of recommendations collected in the review. If substantive changes to the 
statements are recommended, then the cutpoints may no longer serve as translations of the 
statements to the score scale. In that case, it would be necessary to either change the relationship 
among the framework, statements of standards, and cutpoints or to convene (reconvene) another 
panel of standard setters. Having a thorough review of the statements of the standards prior to 
setting the cutpoints seems the best alternative by far. 

Panelist Selection 

It seems unlikely that education standards will be viewed as credible if the standard-setting panels 
are not viewed as credible. Panelists who set standards for public education must be broadly 
representative. Panelists who set standards for public education must be qualified. They must 
understand student behavior and have some knowledge of the knowledge and skills required of 
students in the grade level for which they serve as panelists. These two criteria — broadly 
representative and well qualified — create a challenge for forming standard-setting panels. 



3 The 1996 Science achievement levels developed through the ACT-NAGB ALS Process were not adopted 
by NAGB. NAGB judged that the outcomes of the process were not reasonable, and they decided on 
different cutpoints for most levels: some higher; some lower. As a result of the changes, new descriptions 
had to be developed for reporting student performance relative to the Science NAEP cutpoints. The 
Science NAEP ALDs are based on the items for which students scoring within the cutpoint ranges had at 
least a 65% average probability of correct response. The descriptions do not necessarily reflect the NAGB 
policy definitions of Basic, Proficient, and Advanced performance for each grade in science. 

4 The process is described and completely documented in Loomis and Hanick, 2000. 



NAGB decided from the beginning that the NAEP ALS process should include representatives of 
the general public as well as educators. They specified that approximately 55% of the panelists be 
classroom teachers in the subject and grade levels for which achievement levels were set; 15% 
other educators — counselors, curriculum directors, higher education faculty in the subject, and so 
forth; and 30% general public. ACT developed guidelines for panelists of each type to help assure 
that there was a reasonable expectation that panelists were qualified to make judgements about 
students in both the subject matter and the grade level for which they would serve as panelists. 

The process of identifying and selecting NAEP ALS panelists has been well documented and 
reviewed. (See Raymond & Reid, 2001 .) ACT included a complete description of the plan in the 
Design Document for each of the ALS procedures, and the documents were sent out for public 
review and commentary by stakeholder groups prior to implementation in the first two contracts. 5 

The plan incorporates principals of statistical sampling procedures. A national database of school 
districts serves as the primary sampling unit. Nominators are identified in each district, and they 
are invited to submit names of persons who meet the guidelines (distributed to nominators) for 
panelist selection. Nominators must supply information about the candidates that is then used for 
purposes of selecting panel members. 

The most well qualified nominees are given first priority for selection. In addition, panels are 
drawn to be representative with respect to panelist type (teachers, other educators, and general 
public — according to NAGB percentage requirements), gender (as nearly equal as possible), 
region (as nearly equal as possible), and race/ethnicity (as nearly proportional to the U.S. 
population as possible). The specific features to be represented on the panels changed somewhat 
over time. Initially, region was a factor included in drawing the sample of districts to identify 
nominators, but it was not a factor for equal representation on the panels in the first contract 
period. The aim for representation by race/ethnicity was to have at least 20% of the panelists 
from minority groups, in general. This changed in the 1998 process so that proportional 
representation by specific racial/ethnic groups (Asian Americans, African Americans, Hispanic 
and Mexican Americans, and Native Americans) was the goal. 

Twenty panelists were recruited for the ALS panels for each subject and grade level in the 1992 
process. The number of panelists was increased to 30 for each in subsequent ALS procedures. 
Pilot study panelists were selected through exactly the same procedures used for the ALS process, 
and pilot studies generally included 20 panelists for each grade in a subject. Panel sizes for field 
trials and validation research studies varied according to the design requirements. As a general 
rule, the aim was to have at least 10 panelists participating in a procedure or research group. 

The method of selecting panelists was submitted to broad-based review and evaluation. The 
process has generally been found to be thorough and effective as a means for selecting 
representative panels of qualified panelists that meet the specified requirements established by 
NAGB and recommended by various stakeholders and the TACSS. 



5 The stakeholder lists were modified for each subject to include key agencies and organizations, but many 
umbrella-like education agencies and organizations were included on each list of approximately 200 
individuals and groups. 



O 

ERIC 



4 



6 



Training 

Field Trials and Pilot Studies 

One full-scale pilot study is now conducted for each subject before the operational ALS is 
implemented. There was only one pilot study for the 1992 ALS cycle. Panelists were selected 
from the St. Louis area, and the selection criteria now in place were not used. 

Twenty panelists were recruited for each pilot studies in geography and U.S. history for the 1994 
ALS cycle. Those pilot studies were used for collecting research data on several alternative 
methods and procedures. ACT staff recommended against that plan for the next cycle. 

Because the Science NAEP included hands-on tasks, two pilot studies were conducted. The first 
study was to train in the use of hands-on materials, to study the effects of some unusual scoring 
procedures for some constructed response items in science, and to get a sense of the amount of 
time needed to work with the science assessment. The first study included 10 panelists at 
grade 8, 20 at grade 4, and 30 at grade 12. The second pilot study was a full-scale test of 
procedures planned for the operational ALS. One procedure to collect information on item 
mapping criteria was included. 

In 1998, a series of research studies were proposed to prepare for the operational ALS process. 
Five different panel meetings were convened before the pilot studies. These field trials were to 
test out procedures and collect research information related to procedures planned for the 1998 
ALS cycle. The pilot study in each subject was a full-scale practice run for the ALS. A de- 
briefing session was also held after each pilot study with a representative sample of panelists at 
each grade level. A prepared list of discussion topics was distributed, and panelists were urged to 
comment on any additional matters that needed to be discussed. Panelists were aware that there 
would be a de-briefing, and they often shared their comments with persons who had been selected 
to participate in the de-briefing. Several changes in the agenda were made, and some procedures 
were added or modified as a result of the de-briefings. 

Facilitators 

Training is an issue for facilitators, as well as for panelists. Process facilitators are trained 
extensively in a series of half-day and whole-day sessions over a period of several weeks. 

Training continues until they are thoroughly familiar with the procedures and with the content of 
the materials used in each session. Starting with the 1994 ALS cycle, detailed outlines have been 
provided to the facilitators that describe each step in the process. In addition, 
instructional/training materials presented in the general session are copied and distributed to each 
grade facilitator so that reviews of procedures and instructions are consistent across the grade 
groups. Every effort is made to assure consistency in training across grade groups. As the process 
evolved, it became apparent that process facilitators must be well trained in quantitative 
analysis — ideally, in measurement. Process facilitators must agree to be team players, and they 
must possess a high level of people skills. 

Changes in the content staff were made over time as well. Content facilitators in the first contract 
were generally highly experienced ACT staff working in test development in the specific subject 
area. Key persons from the framework committees in each of those subjects worked with the 
ACT staff to prepare for the task of training panelists in the framework and developing 
achievement levels descriptions. Starting with the second contract, content facilitators have been 
selected from among the persons who served on the framework committees. They have first-hand 
information about decisions regarding the content, format, and organization of the assessment and 
about the development of the achievement levels descriptions. Recommendations from persons 
who worked with these committees are helpful in selecting facilitators who are likely to work 




5 



well in the standard-setting context. Most of the content facilitators have been involved in all 
stages of the development of the NAEP in a subject before participating in the ALS process. 

A full-day training session including both content and process facilitators is held prior to the pilot 
study. Two-three hour review/retraining sessions are held before the start of the panel 
meetings — both pilot studies and ALS meetings. With only one or two exceptions all facilitators 
participated in the pilot study before the ALS meetings. 

It is not to monitor activities simultaneously in all grade groups, and it is not desirable to need to 
do so. Selecting the right people to serve as facilitators is very important and providing them 
with the necessary training and instructional support is essential to a successful standard-setting 
process. Panelists must feel confident that the facilitators are entirely competent and well trained 
in the process and tasks. 

Panelists 

Standard-setting panelists must be well trained. They must be trained to understand the purposes 
of setting standards, the assessment and scoring protocols, the statements of standards, the rating 
method(s), and the feedback. The credibility and defensibility of the process and the outcomes of 
the standard-setting process are a function of the level of agreement among panelists on the 
meaning of the statements of the standards and on the performance required of borderline 
students. The credibility and defensibility of the process are also a function of the extent to which 
panelists appear to understand the tasks and to feel confident in and satisfied with their ability to 
perform the tasks. This requires training for panelists — especially broadly representative 
panelists. 

The NAEP ALS Process probably provides the most extensive and intensive training for panelists 
of any standard-setting process. A detailed description of the procedures for training panelists 
can be found in Loomis and Hanick (2000b; 2000c). Raymond and Reid (2001) provide a rather 
detailed description and discussion of the NAEP training procedures, and they give the 
procedures a favorable review. In part, the extensive and intensive training is required by the 
combination of types of panelists included in the process. 

Training begins early — shortly after panelists are selected. Three sets of advance materials are 
sent to panelists, along with detailed letters instructing them in the use of the materials and 
preparing them for the process. Advance mailings are spaced at approximately 10-14 day 
intervals. 

Once on site, panelists participate in instructional sessions and hands-on training sessions, and 
they do not start rating items until the afternoon of the third day of the process. During the 
training period, panelists are first given a comprehensive overview of the process and purposes 
for setting NAEP Achievement Levels. The general orientation and overviews are presented by 
both the ACT general facilitator and by the NAGB staff person in charge of achievement levels. 
Panelists take the NAEP under timed conditions and review their work with answer keys and 
scoring rubrics. They spend most of the first 2 !4 days gaining an understanding of the meaning of 
the achievement levels descriptions and forming a clear concept of borderline performance. In 
addition, panelists become familiar with the assessment framework, the item pool, the scoring 
rubrics, and other key features of the assessment. Because most NAEP item pools include a large 
number of constructed response items, a large amount of time is needed to help panelists become 
familiar with the scoring rubrics and to have a clear, realistic understanding of student responses 
to open-ended questions. Panelists are given ample opportunity for discussions with other 




6 



8 



panelists in their subject group and for asking questions and seeking clarification from process 
and content facilitators. 

All instructions and initial training sessions take place in general sessions including all panelists 
from each grade level. This assures that every panelist hears the same instructions and receives 
the same training in the procedures. Procedures are implemented in grade-group sessions led by 
process facilitators. The amount of time required for each procedure is carefully estimated on the 
basis of data collected through field trials, pilot studies, and previous operational ALS studies. 

Too little time to complete a task adds to the frustration of panelists and jeopardizes the 
possibility of a valid outcome. Similarly, too much time is likely to lead panelists to redefine the 
task and to have an unintended result. Thus, accurate time estimates are very important to the 
substantive outcome of the process, as well as to the logistic aspects of the process. 

Panelists participate in training that focuses their attention on the achievement levels descriptions 
and borderline performance. Training includes exercises at the individual item level and 
holistically, over a NAEP test booklet. Training materials in exercises do not include items that 
will later be rated by panelists. This means that judgements made in this training period will not 
influence judgements to be made after training is completed and ratings are underway. At the 
same time, however, the training materials allow panelists to become familiar with all the items in 
the NAEP for their grade level. Once panelists are trained for the rating task, they participate in 
the first round of ratings using only the achievement levels descriptions, concepts of borderline 
performance or written borderline descriptors, and the information gathered in training. Once the 
item-by-item rating sessions begin, discussions among panelists are not allowed. 

Rating Methodology 

Several different rating methods have been evaluated and tested in panel studies for the NAEP 
ALS Process. The modified Angoff method, and variants of it, has been used for setting all 
NAEP achievement levels. Because the method for collecting judgments is frequently at the 
center of standard-setting decisions, and because the method used for the NAEP ALS Process has 
been the criticized (National Academy of Education, 1993; National Research Council, 1999), a 
description of several methods tested by ACT is presented here, along with the general rationale 
for the acceptance or rejection of the method(s). A more complete review of the various methods 
is presented in Loomis and Bourque, 2001. 

For ACT’s first standard-setting contract, NAGB specified that the modified Angoff method be 
used. In addition, ACT designed a Paper Selection method for use with the constructed response 
items in the 1992 ALS process. The choice of methods was left to ACT in the second contract 
awarded, although NAGB specified that the method selected should have a sound research base 
and should not be likely to produce achievement levels that were greatly different from those 
already set in 1992. 

ACT tested several variants of the modified Angoff method in pilot studies for the 1 994 ALS 
process. The final decision was to use the modified Angoff method for multiple-choice items and 
a mean estimation procedure — quite similar to the modified Angoff method — for constructed 
response items. The Technical Advisory Committee on Standard Setting recommended that a 
paper selection procedure be implemented prior to the rating process as part of training in 
borderline performance and preparation for rating constructed response items. This was, in part, 
recommended because of the research base established for the paper selection procedure in the 
1992 ALS process. 




7 



9 



In choosing the rating methodology, TACSS has counseled that the rating method used to 
establish cutpoints on the score scale should be compatible with the scaling method used to put 
student scores on the reporting scale. Kane (2001) suggests that the method “be consistent with 
the design of the assessment procedure, and both the standard-setting method and the assessment 
procedure should be consistent with the conception of achievement underlying the decision 
process.” (p.59) For example, a noncompensatory rating method should not be used to set 
cutpoints for an assessment using a compensatory scaling model. 

Research by ACT has consistently revealed that panelists tend to assign more weight to items 
requiring constructed responses than to multiple -choice items (Loomis & Hanick, 2000b; 2000c). 
Panelists also tend to select some items that they consider essential indicators of performance at a 
particular level of achievement and to weight them disproportionately when judging student 
performance holistically. So, no mater how well the student performs on the overall assessment, 
the panelist may perceive that the student has a relatively low level of achievement due to the 
failure to correctly answer a particular question or set of questions related to a particular area of 
knowledge or skill. Given the fact that NAEP uses a compensatory scaling model, an item-by- 
item rating method has been favored over holistic methods for the NAEP ALS Process. 

The cutpoints should result from the judgments of panelists and not from the methodology. ACT 
experimented with the use of item mapping procedures, similar to the Bookmark methodology. 
One proposed plan was to have panelists use an item-by-item rating process for two (or three) 
rounds and then switch to the item maps for selecting the final cutpoints. This combination of 
methods was tested in a field trial for civics and found to work well (Loomis, Bay, Yang & 
Hanick, 1999; Loomis, Hanick, Bay, & Crouse, 2000a). The need to have a probability value to 
map the items to the scale, however, was judged to be a problem with this method. See Loomis 
and Bourque (2001) for an example and discussion of this problem. A lower probability used in 
mapping items would result in a relatively lower cutpoint and a higher probability value would 
result in a higher cutpoint. 

Once the panelists have made their judgments about where to draw the boundary between two 
achievement levels (where to set the cutpoint), the actual value of the cutpoint on the score scale 
must be determined. That cannot be done without some decision regarding the p- value for 
mapping items. There was no established criterion for determining how to map (locate) items on 
the score scale. NAGB had established no such criterion and none existed in the research 
literature. Indeed, the research conducted to date has revealed no consensus on the choice of p- 
values for mapping. 6 In the absence of a policy decision to determine the p-value, TACSS 
recommended against further research regarding the use of the item mapping method. 

ACT tested a variant of the modified Angoff method — th eyes/no method or correct/incorrect 
method of rating items in the first set of field trials for the 1998 NAEP ALS Process. Impara and 
Plake (1996) had reported that panelists found the method easier to use than the modified Angoff 
procedure that required estimates of percentages of borderline students who would correctly 
answer questions. Panelists who used the method in the field trials conducted by ACT reported 
that it was easy to use and conceptually clear, but the responses of those field trial panelists were 
no less positive than the panelists who had estimated percentages in other ALS studies (Loomis, 
Bay, Yang, and Hanick, 1999; and Loomis, Hanick, Bay, & Crouse, 2000a; 2000b). Further 
analyses by ACT (Reckase, 1998; Reckase & Bay, 1999) revealed that the Item Score String 



6 ACT conducted several research studies as part of the 1996 Science NAEP ALS Process (ACT, 1997). 
Zwick, et. al (2000) examined this issue with respect to the choice of p-value to use in selecting exemplar 
items. Kolstad has researched the issue extensively as well. See Kolstad (1996) and Kolstad, et. al (1998). 




8 



10 



